Author: Y.X. Wu
This notebook aims to conduct a comparative analysis between our new data set and existing literature data set based on two properties: hardness and corrosion. Both data sets have a defined compositional feature space. The new data set, however, is generated in a different regime of the compositional feature space.
The purpose of the analysis is twofold:
Principal Component Analysis (PCA): We'll use PCA as a technique to reduce the dimensionality of the feature space. It will allow us to effectively visualize the new data alongside the literature data within the same reduced feature space. PCA does this by transforming the original variables into a new set of uncorrelated variables called principal components, which are ordered by the amount of variance they can explain from the original data. This allows us to capture most of the information in the original data with fewer dimensions.
Uniform Manifold Approximation and Projection (UMAP): To complement the dimensionality reduction done by PCA, we also employ UMAP, which is a nonlinear dimensionality reduction method. It excels in preserving the structure of high-dimensional data in low-dimensional space, making it suitable for visualizing clusters or groups within data. Unlike PCA, UMAP can capture nonlinear relationships within the data, making it more capable of separating different classes or clusters. Our UMAP application involves a Mahalanobis distance metric, which respects the covariance of the data, and the implementation scales the data and calculates the inverse covariance matrix as part of its workflow. This technique will further aid in visualizing and understanding the structure of our compositional feature space.
Mahalanobis Distance Calculation: Following the visualization, we'll take a more in-depth look into the data by calculating the Mahalanobis distance for all the new data points from the centroid of the literature data set. The Mahalanobis distance is a measure of the distance between a point and a distribution, not between two distinct points. It's effectively a multivariate equivalent of the Euclidean distance. However, unlike Euclidean distance, the Mahalanobis distance is scale-invariant and takes into account the correlations of the data set. By calculating the Mahalanobis distance, we can quantify how much the new data deviates from the distribution of the literature data.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import numpy as np
import math
import matplotlib.cm as cm
import matplotlib as mpl
import os
# display the current working directory
display("Current working directory: {0}".format(os.getcwd()))
data_path = '../Dataset_Cleaned/'
display(os.path.isfile(data_path+'LiteratureDataset_Hardness_YW_v3_processed.xlsx'))
'Current working directory: /nethome/home3/yuxiang.wu/CCA-representation-ML/Dataset_EDA_Feature_UMAP_Mahalanobis'
True
# Declare column names for the chemical composition dataframe, specific testing conditions, selected features, and output for Hardness and Corrosion datasets.
compo_column = ['Fe', 'Cr', 'Ni', 'Mo', 'W', 'N', 'Nb', 'C', 'Si',
'Mn', 'Cu', 'Al', 'V', 'Ta', 'Ti', 'Co', 'Mg', 'Y', 'Zr', 'Hf']
C_specific_testing_column = ['TestTemperature_C', 'ChlorideIonConcentration', 'pH', 'ScanRate_mVs']
specific_features_sel_column = ['delta_a', 'Tm', 'sigma_Tm', 'Hmix', 'sigma_Hmix', 'sigma_elec_nega', 'VEC', 'sigma_VEC']
H_output_column = ['converted HV']
C_output_column = ['AvgPittingPotential_mV']
# Load the Hardness and Corrosion datasets
df_H = pd.read_excel(data_path + 'LiteratureDataset_Hardness_YW_v3_processed.xlsx')
df_C = pd.read_excel(data_path + 'LiteratureDataset_Corrosion_YW_v3_processed.xlsx')
# Partition the datasets into component composition, specific features, and output data
df_H_compo, df_H_specific_features, df_H_output = df_H[compo_column], df_H[specific_features_sel_column], df_H[H_output_column]
(df_C_compo, df_C_specific_testing,
df_C_specific_features, df_C_output) = df_C[compo_column], df_C[C_specific_testing_column], df_C[specific_features_sel_column], df_C[C_output_column]
df_H_compo_specific_features = pd.concat([df_H_compo, df_H_specific_features], axis=1)
df_C_compo_specific_features = pd.concat([df_C_compo, df_C_specific_features], axis=1)
# Specify columns for NiCrCoVFe and NiCrMoTiFe composition dataframes
NiCrCoVFe_compo_column = ['Ni', 'Cr', 'Co', 'V', 'Fe']
NiCrMoTiFe_compo_column = ['Ni', 'Cr', 'Mo', 'Ti', 'Fe']
# Load NiCrCoVFe and NiCrMoTiFe datasets
df_NiCrCoVFe = pd.read_excel(data_path + 'MultiTaskModel_NiCrCoVFe_KW99_wt_pct_processed.xlsx')
df_NiCrMoTiFe = pd.read_excel(data_path + 'MultiTaskModel_NiCrMoTiFe_KW131_wt_pct_processed.xlsx')
# Extract composition and specific feature data from each dataset
df_NiCrCoVFe_compo, df_NiCrCoVFe_specific_features = df_NiCrCoVFe[NiCrCoVFe_compo_column], df_NiCrCoVFe[specific_features_sel_column]
df_NiCrMoTiFe_compo, df_NiCrMoTiFe_specific_features = df_NiCrMoTiFe[NiCrMoTiFe_compo_column], df_NiCrMoTiFe[specific_features_sel_column]
# Create a base dataframe for composition data with required columns
df_compo = pd.DataFrame(columns=compo_column)
# Merge base composition dataframe with each dataset's composition, filling missing values with 0
df_NiCrCoVFe_compo = pd.concat([df_compo, df_NiCrCoVFe_compo], axis=0).fillna(0)
df_NiCrMoTiFe_compo = pd.concat([df_compo, df_NiCrMoTiFe_compo], axis=0).fillna(0)
# Combine composition and specific feature data for each dataset
df_NiCrCoVFe_compo_specific_features = pd.concat([df_NiCrCoVFe_compo, df_NiCrCoVFe_specific_features], axis=1)
df_NiCrMoTiFe_compo_specific_features = pd.concat([df_NiCrMoTiFe_compo, df_NiCrMoTiFe_specific_features], axis=1)
# Display the first row of each combined dataframe for verification
display(df_NiCrCoVFe_compo_specific_features.head(1))
display(df_NiCrMoTiFe_compo_specific_features.head(1))
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Zr | Hf | delta_a | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10.271486 | 31.290494 | 51.985556 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0.007568 | 1898.775789 | 210.082957 | -6.271299 | 2.111357 | 0.115683 | 8.311398 | 1.854884 |
1 rows × 28 columns
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Zr | Hf | delta_a | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14.316994 | 35.210247 | 45.453038 | 4.138939 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0.022535 | 1943.689937 | 257.662231 | -6.600671 | 2.725195 | 0.129477 | 8.019565 | 1.882783 |
1 rows × 28 columns
# Add 'dataset' column to Corrosion, Hardness, and the two new dataframes
for df_compo, df_compo_specific_features, label in zip(
[df_C_compo, df_H_compo, df_NiCrMoTiFe_compo, df_NiCrCoVFe_compo],
[df_C_compo_specific_features, df_H_compo_specific_features, df_NiCrMoTiFe_compo_specific_features, df_NiCrCoVFe_compo_specific_features],
['corrosion data', 'hardness data', 'new NiCrMoTiFe data', 'new NiCrCoVFe data']):
df_compo['dataset'] = label
df_compo_specific_features['dataset'] = label
# Combine Corrosion and Hardness composition data into a single dataframe
df_compo_conc = pd.concat([df_C_compo, df_H_compo], ignore_index=True)
# df_compo_conc.to_excel('pairplot_corrosion_hardness_datasets.xlsx', index=False)
display(df_compo_conc.iloc[[0, -1]], df_compo_conc.shape)
# Add the new datasets to the combined composition dataframe
df_compo_conc_new = pd.concat([df_compo_conc, df_NiCrMoTiFe_compo, df_NiCrCoVFe_compo], ignore_index=True)
display(df_compo_conc_new.iloc[[0, 712, -70, -1]], df_compo_conc_new.shape)
# Combine Corrosion and Hardness composition with specific features into a single dataframe
df_compo_specific_features_conc = pd.concat([df_C_compo_specific_features, df_H_compo_specific_features], ignore_index=True)
display(df_compo_specific_features_conc.iloc[[0, -1]], df_compo_specific_features_conc.shape)
# Add the new datasets to the combined composition with specific features dataframe
df_compo_specific_features_conc_new = pd.concat([df_compo_specific_features_conc, df_NiCrMoTiFe_compo_specific_features, df_NiCrCoVFe_compo_specific_features], ignore_index=True)
display(df_compo_specific_features_conc_new.iloc[[0, 712, -70, -1]], df_compo_specific_features_conc_new.shape)
/tmp/ipykernel_29163/3779178539.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_compo['dataset'] = label /tmp/ipykernel_29163/3779178539.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_compo['dataset'] = label
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Al | V | Ta | Ti | Co | Mg | Y | Zr | Hf | dataset | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 69.77 | 18.0 | 10.0 | 0.00 | 0.0 | 0.0 | 0.0 | 0.03 | 1.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | corrosion data |
| 1391 | 20.17 | 0.0 | 42.4 | 6.93 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.0 | ... | 0.0 | 9.2 | 0.0 | 0.0 | 21.29 | 0.0 | 0.0 | 0.0 | 0.0 | hardness data |
2 rows × 21 columns
(1392, 21)
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Al | V | Ta | Ti | Co | Mg | Y | Zr | Hf | dataset | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 69.770000 | 18.000000 | 10.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.03 | 1.0 | 1.00 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | corrosion data |
| 712 | 19.920000 | 18.540000 | 20.930000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 19.59 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 21.02000 | 0.0 | 0.0 | 0.0 | 0.0 | hardness data |
| 1460 | 63.370312 | 5.076982 | 23.439965 | 4.034033 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.00 | ... | 0.0 | 0.000000 | 0.0 | 4.078709 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | new NiCrMoTiFe data |
| 1529 | 51.347188 | 5.827793 | 31.584854 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.00 | ... | 0.0 | 8.142295 | 0.0 | 0.000000 | 3.09787 | 0.0 | 0.0 | 0.0 | 0.0 | new NiCrCoVFe data |
4 rows × 21 columns
(1530, 21)
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Hf | delta_a | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | dataset | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 69.77 | 18.0 | 10.0 | 0.00 | 0.0 | 0.0 | 0.0 | 0.03 | 1.0 | 1.0 | ... | 0.0 | 0.019485 | 1869.874935 | 173.330067 | -4.484751 | 4.974488 | 0.083927 | 7.719083 | 1.188494 | corrosion data |
| 1391 | 20.17 | 0.0 | 42.4 | 6.93 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.0 | ... | 0.0 | 0.024620 | 1852.264411 | 257.734769 | -6.826400 | 3.890390 | 0.103641 | 8.659625 | 1.614745 | hardness data |
2 rows × 29 columns
(1392, 29)
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Hf | delta_a | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | dataset | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 69.770000 | 18.000000 | 10.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.03 | 1.0 | 1.00 | ... | 0.0 | 0.019485 | 1869.874935 | 173.330067 | -4.484751 | 4.974488 | 0.083927 | 7.719083 | 1.188494 | corrosion data |
| 712 | 19.920000 | 18.540000 | 20.930000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 19.59 | ... | 0.0 | 0.032693 | 1801.194406 | 214.312458 | -4.159700 | 2.197442 | 0.138357 | 8.000092 | 1.414148 | hardness data |
| 1460 | 63.370312 | 5.076982 | 23.439965 | 4.034033 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.00 | ... | 0.0 | 0.039801 | 1844.869218 | 192.175144 | -5.701689 | 4.064609 | 0.098628 | 8.101491 | 1.410874 | new NiCrMoTiFe data |
| 1529 | 51.347188 | 5.827793 | 31.584854 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.00 | ... | 0.0 | 0.016731 | 1841.240483 | 148.982134 | -5.437576 | 2.782294 | 0.086038 | 8.238582 | 1.496411 | new NiCrCoVFe data |
4 rows × 29 columns
(1530, 29)
composition space
sns.set_context("notebook", font_scale=2)
# Define color palette for the pairplot
palette = ["firebrick", "dodgerblue", "green", "darkorange"]
# Create pairplot with KDE for all data
grid_kde = sns.pairplot(df_compo_conc_new, vars=['Fe', 'Cr', 'Ni', 'Mo', 'Ti', 'Co', 'V'],
hue="dataset", kind="kde", corner=True, palette=palette)
# Adjust x and y limits using list comprehension
_ = [[ax.set_xlim(left=0), ax.set_ylim(bottom=0)] for ax_row in grid_kde.axes for ax in ax_row if ax is not None]
plt.savefig('pairplot_compo_section_literature+new.png', bbox_inches='tight')
# Show the plots
plt.show()
composition + engineered feature space
sns.set_context("notebook", font_scale=2)
# Define color palette for the pairplot
palette = ["firebrick", "dodgerblue", "green", "darkorange"]
# Create pairplot with KDE for all data
grid_kde = sns.pairplot(df_compo_specific_features_conc_new, vars=specific_features_sel_column,
hue="dataset", kind="kde", corner=True, palette=palette)
# Adjust x and y limits using list comprehension
_ = [[ax.set_xlim(left=0),
ax.set_ylim(bottom=0),
ax.xaxis.label.set_size(20),
ax.yaxis.label.set_size(20)] for ax_row in grid_kde.axes for ax in ax_row if ax is not None]
plt.savefig('pairplot_feature_literature+new.png', bbox_inches='tight')
# Show the plots
plt.show()
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns
def perform_pca(df, dim_name1='Principle Component 1', dim_name2='Principle Component 2'):
df_pca = df.copy()
# Scale the data
X_conc = MinMaxScaler().fit_transform(df_pca.drop(columns='dataset').values)
y_conc = df_pca['dataset'].values
# Perform PCA
pca = PCA()
X_conc_r = pca.fit_transform(X_conc)
X_conc_r = X_conc_r[:, :2]
# Add the PCA components to the dataframe
df_pca[dim_name1] = X_conc_r[:, 0]
df_pca[dim_name2] = X_conc_r[:, 1]
# Print explained variance ratio
print(f"Explained variance ratio (first two components): {pca.explained_variance_ratio_[0:2]}")
return df_pca
def plot_data(df_pca, title='PCA',
dim_name1='Principle Component 1', dim_name2='Principle Component 2',
axis_lim = [-1.3, 1.3, -1.3, 1.3], threshold=0.1, levels=5):
# Set up the plot
colors = ["firebrick", "dodgerblue", "green", "darkorange"]
# Create a joint plot with KDE contour
g = sns.jointplot(data=df_pca, x=dim_name1, y=dim_name2, hue='dataset', palette=colors, alpha=0.5)
g.plot_joint(sns.kdeplot, zorder=0, palette=colors, levels=levels, linewidths=1, alpha=0.75, thresh=threshold)
# Customize the plot
g.ax_joint.grid(linewidth=0.1)
g.ax_joint.set_xlabel(dim_name1, fontsize=15)
g.ax_joint.set_ylabel(dim_name2, fontsize=15)
g.ax_joint.tick_params(labelsize=15)
# g.ax_joint.set_aspect('equal', 'box')
g.ax_joint.set_xlim(axis_lim[0], axis_lim[1])
g.ax_joint.set_ylim(axis_lim[2], axis_lim[3])
# g.ax_joint.set_title(title, fontsize=20, y=1.5)
g.ax_joint.legend(loc='upper right', fontsize=15, bbox_to_anchor=(2.5, 1))
# Set the same limits for the marginal plots
g.ax_marg_x.set_xlim(axis_lim[0], axis_lim[1])
g.ax_marg_y.set_ylim(axis_lim[2], axis_lim[3])
# Save and show the plot
plt.savefig(title + '.png', bbox_inches='tight')
plt.show()
# Perform PCA on df_compo_conc_new
df_compo_conc_new_pca = perform_pca(df_compo_conc_new, dim_name1='Principle Component 1', dim_name2='Principle Component 2')
# Perform PCA on df_compo_specific_features_conc_new
df_compo_specific_features_conc_new_pca = perform_pca(df_compo_specific_features_conc_new,
dim_name1='Principle Component 1',
dim_name2='Principle Component 2')
# Plot data for df_compo_conc_new_pca
plot_data(df_compo_conc_new_pca, title='PCA 2D_Compositional Space',
dim_name1='Principle Component 1', dim_name2='Principle Component 2',
axis_lim=[-1, 1.3, -1, 1.05], threshold=0.02, levels=8)
# Plot data for df_compo_specific_features_conc_new_pca
plot_data(df_compo_specific_features_conc_new_pca, title='PCA 2D_Compositional and Engineered Feature Space',
dim_name1='Principle Component 1', dim_name2='Principle Component 2',
axis_lim=[-1, 1.3, -1, 1.05], threshold=0.02, levels=8)
Explained variance ratio (first two components): [0.24339372 0.18287113] Explained variance ratio (first two components): [0.23460085 0.19482752]
from umap import UMAP
from sklearn.covariance import EmpiricalCovariance
from scipy.linalg import inv
import numpy as np
import random
np.random.seed(0)
random.seed(0)
def perform_umap(df, n_neighbors=30, min_dist=0.6, n_components=2, metric='mahalanobis',
dim_name1='UMAP Component 1', dim_name2='UMAP Component 2', random_state=42):
df_umap = df.copy()
# Scale the data
X_conc = MinMaxScaler().fit_transform(df_umap.drop(columns='dataset').values)
y_conc = df_umap['dataset'].values
# Calculate inverse covariance matrix
cov = EmpiricalCovariance().fit(X_conc)
inv_cov_matrix = inv(cov.covariance_)
# Perform UMAP
umap = UMAP(n_neighbors=n_neighbors, min_dist=min_dist, n_components=n_components,
metric=metric, metric_kwds={'VI': inv_cov_matrix}, random_state=random_state)
X_conc_r = umap.fit_transform(X_conc)
# Add the UMAP components to the dataframe
df_umap[dim_name1] = X_conc_r[:, 0]
df_umap[dim_name2] = X_conc_r[:, 1]
return df_umap
# Perform UMAP on df_compo_conc_new
df_compo_conc_new_umap = perform_umap(df_compo_conc_new, dim_name1='UMAP Component 1', dim_name2='UMAP Component 2')
# Perform UMAP on df_compo_specific_features_conc_new
df_compo_specific_features_conc_new_umap = perform_umap(df_compo_specific_features_conc_new,
dim_name1='UMAP Component 1',
dim_name2='UMAP Component 2')
2023-07-03 16:58:16.173530: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-03 16:58:19.102725: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/nethome/yuxiang.wu/anaconda3/envs/tf-env/lib/ 2023-07-03 16:58:19.103596: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/nethome/yuxiang.wu/anaconda3/envs/tf-env/lib/ 2023-07-03 16:58:19.103616: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
# Plot data for df_compo_conc_new_umap
plot_data(df_compo_conc_new_umap, title='UMAP 2D_Compositional Space',
dim_name1='UMAP Component 1', dim_name2='UMAP Component 2',
axis_lim=[-30, 40, -20, 40], threshold=0.001, levels=30)
# Plot data for df_compo_specific_features_conc_new_umap
plot_data(df_compo_specific_features_conc_new_umap, title='UMAP 2D_Compositional and Engineered Feature Space',
dim_name1='UMAP Component 1', dim_name2='UMAP Component 2',
axis_lim=[-30, 40, -20, 40], threshold=0.001, levels=30)
import scipy.stats as stats
from scipy.stats import chi2
import numpy as np
here decide to use composition + engineered feature space for piror Mahalanobis distance analysis
df_C_umap = df_compo_specific_features_conc_new_umap[df_compo_specific_features_conc_new_umap['dataset'] == 'corrosion data']
df_H_umap = df_compo_specific_features_conc_new_umap[df_compo_specific_features_conc_new_umap['dataset'] == 'hardness data']
df_new_FeCrNiMoTi_umap = df_compo_specific_features_conc_new_umap[df_compo_specific_features_conc_new_umap['dataset'] == 'new NiCrMoTiFe data']
df_new_FeCrNiCoV_umap = df_compo_specific_features_conc_new_umap[df_compo_specific_features_conc_new_umap['dataset'] == 'new NiCrCoVFe data']
# prepare the df for calculating Mahalanobis distance (cannot have string in df)
display(df_C_umap.iloc[[0, -1]], df_C_umap.shape,
df_H_umap.iloc[[0, -1]], df_H_umap.shape,
df_new_FeCrNiMoTi_umap.iloc[[0, -1]], df_new_FeCrNiMoTi_umap.shape,
df_new_FeCrNiCoV_umap.iloc[[0, -1]], df_new_FeCrNiCoV_umap.shape)
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | dataset | UMAP Component 1 | UMAP Component 2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 69.77 | 18.00 | 10.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.03 | 1.00 | 1.00 | ... | 1869.874935 | 173.330067 | -4.484751 | 4.974488 | 0.083927 | 7.719083 | 1.188494 | corrosion data | 2.649896 | -0.751512 |
| 711 | 0.16 | 0.23 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.08 | 0.02 | ... | 936.061425 | 54.273452 | -0.350818 | 0.773645 | 0.049889 | 2.982996 | 0.264460 | corrosion data | 4.616013 | -2.950396 |
2 rows × 31 columns
(712, 31)
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | dataset | UMAP Component 1 | UMAP Component 2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 712 | 19.92 | 18.54 | 20.93 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 19.59 | ... | 1801.194406 | 214.312458 | -4.1597 | 2.197442 | 0.138357 | 8.000092 | 1.414148 | hardness data | 11.596552 | -0.424313 |
| 1391 | 20.17 | 0.00 | 42.40 | 6.93 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | ... | 1852.264411 | 257.734769 | -6.8264 | 3.890390 | 0.103641 | 8.659625 | 1.614745 | hardness data | 4.837129 | 12.557776 |
2 rows × 31 columns
(680, 31)
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | dataset | UMAP Component 1 | UMAP Component 2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1392 | 14.316994 | 35.210247 | 45.453038 | 4.138939 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1943.689937 | 257.662231 | -6.600671 | 2.725195 | 0.129477 | 8.019565 | 1.882783 | new NiCrMoTiFe data | 3.048986 | 14.198552 |
| 1460 | 63.370312 | 5.076982 | 23.439965 | 4.034033 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1844.869218 | 192.175144 | -5.701689 | 4.064609 | 0.098628 | 8.101491 | 1.410874 | new NiCrMoTiFe data | 1.525560 | 16.811317 |
2 rows × 31 columns
(69, 31)
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | dataset | UMAP Component 1 | UMAP Component 2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1461 | 10.271486 | 31.290494 | 51.985556 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1898.775789 | 210.082957 | -6.271299 | 2.111357 | 0.115683 | 8.311398 | 1.854884 | new NiCrCoVFe data | 2.840408 | 12.974854 |
| 1529 | 51.347188 | 5.827793 | 31.584854 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1841.240483 | 148.982134 | -5.437576 | 2.782294 | 0.086038 | 8.238582 | 1.496411 | new NiCrCoVFe data | 3.166722 | 14.072919 |
2 rows × 31 columns
(69, 31)
preparing the existing datasets (literature+new)
# Merge existing and new datasets, replacing missing values with 0
df_C_new_FeCrNiMoTi = pd.concat([df_C_umap, df_new_FeCrNiMoTi_umap], ignore_index=True).fillna(0)
df_C_new_FeCrNiCoV = pd.concat([df_C_umap, df_new_FeCrNiCoV_umap], ignore_index=True).fillna(0)
df_H_new_FeCrNiMoTi = pd.concat([df_H_umap, df_new_FeCrNiMoTi_umap], ignore_index=True).fillna(0)
df_H_new_FeCrNiCoV = pd.concat([df_H_umap, df_new_FeCrNiCoV_umap], ignore_index=True).fillna(0)
def process_df_Mahal(df):
df_num = df.loc[:, ~df.columns.isin(['dataset', 'UMAP Component 1', 'UMAP Component 2'])]
df_label = df[['dataset', 'UMAP Component 1', 'UMAP Component 2']]
# Detect zero columns using 'all' method after comparing with 0
zero_columns = (df_num == 0).all()
# Print the names of the columns containing only zeros with context
print("Columns only containing zeros:", zero_columns[zero_columns].index.tolist())
# Drop the zero columns using boolean indexing
df_num = df_num.loc[:, ~zero_columns]
return df_num, df_label
# Process each dataframe and separate numerical and label data
df_C_new_FeCrNiMoTi_num, df_C_new_FeCrNiMoTi_label = process_df_Mahal(df_C_new_FeCrNiMoTi)
df_C_new_FeCrNiCoV_num, df_C_new_FeCrNiCoV_label = process_df_Mahal(df_C_new_FeCrNiCoV)
df_H_new_FeCrNiMoTi_num, df_H_new_FeCrNiMoTi_label = process_df_Mahal(df_H_new_FeCrNiMoTi)
df_H_new_FeCrNiCoV_num, df_H_new_FeCrNiCoV_label = process_df_Mahal(df_H_new_FeCrNiCoV)
display(df_C_new_FeCrNiMoTi_num.iloc[[0,-1]], df_C_new_FeCrNiMoTi_label.iloc[[0,-1]])
Columns only containing zeros: ['Zr', 'Hf'] Columns only containing zeros: ['Zr', 'Hf'] Columns only containing zeros: ['N', 'Mg'] Columns only containing zeros: ['N', 'Mg']
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Mg | Y | delta_a | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 69.770000 | 18.000000 | 10.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.03 | 1.0 | 1.0 | ... | 0.0 | 0.0 | 0.019485 | 1869.874935 | 173.330067 | -4.484751 | 4.974488 | 0.083927 | 7.719083 | 1.188494 |
| 780 | 63.370312 | 5.076982 | 23.439965 | 4.034033 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.039801 | 1844.869218 | 192.175144 | -5.701689 | 4.064609 | 0.098628 | 8.101491 | 1.410874 |
2 rows × 26 columns
| dataset | UMAP Component 1 | UMAP Component 2 | |
|---|---|---|---|
| 0 | corrosion data | 2.649896 | -0.751512 |
| 780 | new NiCrMoTiFe data | 1.525560 | 16.811317 |
calculate the Mahalanobis distance
df_C_new_FeCrNiMoTi_Mahl = df_C_new_FeCrNiMoTi_num.copy()
df_C_new_FeCrNiCoV_Mahl = df_C_new_FeCrNiCoV_num.copy()
df_H_new_FeCrNiMoTi_Mahl = df_H_new_FeCrNiMoTi_num.copy()
df_H_new_FeCrNiCoV_Mahl = df_H_new_FeCrNiCoV_num.copy()
# Function to calculate the Mahalanobis distance of each point to the "center" of the dataset
def calculateMahalanobis(y=None, data=None, cov=None):
# Compute the Mahalanobis Distance between each row of y and the data
# y : matrix of data with, say, p columns (new observation).
# data : ndarray of the distribution (existing data), from which Mahalanobis distance of each observation of y is to be computed.
# cov : covariance matrix (p x p) of the distribution(existing data). If None, will be computed from data.
y_mu = y - np.mean(data, axis=0)
if cov is None:
cov = np.cov(data.values.T)
cov += np.eye(cov.shape[0]) * 1e-6
inv_covmat = np.linalg.inv(cov)
left = np.dot(y_mu, inv_covmat)
mahal = np.dot(left, y_mu.T)
# print(y_mu.isna().sum().sum()) # check if y_mu has any NaN
# print(inv_covmat) # see the inverse covariance matrix
return mahal.diagonal()
# y: new observation, data: existing data from which Mahalanobis distance is to be computed.
df_C_new_FeCrNiMoTi_Mahl['Mahalanobis'] = calculateMahalanobis(y = df_C_new_FeCrNiMoTi_Mahl, data= df_C_new_FeCrNiMoTi_num)
df_C_new_FeCrNiCoV_Mahl['Mahalanobis'] = calculateMahalanobis(y = df_C_new_FeCrNiCoV_Mahl, data= df_C_new_FeCrNiCoV_num)
df_H_new_FeCrNiMoTi_Mahl['Mahalanobis'] = calculateMahalanobis(y = df_H_new_FeCrNiMoTi_Mahl, data= df_H_new_FeCrNiMoTi_num)
df_H_new_FeCrNiCoV_Mahl['Mahalanobis'] = calculateMahalanobis(y = df_H_new_FeCrNiCoV_Mahl, data= df_H_new_FeCrNiCoV_num)
display(df_C_new_FeCrNiMoTi_Mahl.head(1))
display(df_C_new_FeCrNiCoV_Mahl.head(1))
display(df_H_new_FeCrNiMoTi_Mahl.head(1))
display(df_H_new_FeCrNiCoV_Mahl.head(1))
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Y | delta_a | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | Mahalanobis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 69.77 | 18.0 | 10.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.03 | 1.0 | 1.0 | ... | 0.0 | 0.019485 | 1869.874935 | 173.330067 | -4.484751 | 4.974488 | 0.083927 | 7.719083 | 1.188494 | 4.557613 |
1 rows × 27 columns
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Y | delta_a | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | Mahalanobis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 69.77 | 18.0 | 10.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.03 | 1.0 | 1.0 | ... | 0.0 | 0.019485 | 1869.874935 | 173.330067 | -4.484751 | 4.974488 | 0.083927 | 7.719083 | 1.188494 | 4.470198 |
1 rows × 27 columns
| Fe | Cr | Ni | Mo | W | Nb | C | Si | Mn | Cu | ... | Hf | delta_a | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | Mahalanobis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 19.92 | 18.54 | 20.93 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 19.59 | 0.0 | ... | 0.0 | 0.032693 | 1801.194406 | 214.312458 | -4.1597 | 2.197442 | 0.138357 | 8.000092 | 1.414148 | 9.258763 |
1 rows × 27 columns
| Fe | Cr | Ni | Mo | W | Nb | C | Si | Mn | Cu | ... | Hf | delta_a | Tm | sigma_Tm | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | Mahalanobis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 19.92 | 18.54 | 20.93 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 19.59 | 0.0 | ... | 0.0 | 0.032693 | 1801.194406 | 214.312458 | -4.1597 | 2.197442 | 0.138357 | 8.000092 | 1.414148 | 9.317482 |
1 rows × 27 columns
# Define dataframes in a list
dfs = [df_C_new_FeCrNiMoTi_Mahl, df_C_new_FeCrNiCoV_Mahl, df_H_new_FeCrNiMoTi_Mahl, df_H_new_FeCrNiCoV_Mahl]
titles = ['df_C_new_FeCrNiMoTi_Mahl', 'df_C_new_FeCrNiCoV_Mahl', 'df_H_new_FeCrNiMoTi_Mahl', 'df_H_new_FeCrNiCoV_Mahl']
# Set up subplots
fig, axs = plt.subplots(2, 4, figsize=(20, 10))
fig.suptitle('Mahalanobis distance', fontsize=20)
# Iterate through dataframes
for i, df in enumerate(dfs):
# Plot histogram
axs[0,i].hist(df['Mahalanobis'], bins=100, edgecolor='black', alpha=0.5, density=True)
axs[0,i].set_xlabel('Mahalanobis distance')
axs[0,i].set_ylabel('Frequency')
axs[0,i].set_xlim(0, 200)
axs[0,i].set_title(titles[i])
# Fit a chi-squared distribution to the data
df_param, loc, scale = stats.chi2.fit(df['Mahalanobis'])
# Plot the fitted distribution over the histogram
x_chi2 = np.linspace(0, np.amax(df['Mahalanobis']), 500)
pdf_chi2 = stats.chi2.pdf(x_chi2, df=df_param, loc=loc, scale=scale)
axs[0,i].plot(x_chi2, pdf_chi2, 'maroon')
# Add the fitted values to the plot
axs[0,i].text(0.7, 0.5, f"df = {df_param:.2f}\nloc = {loc:.2f}\nscale = {scale:.2f}",
transform=axs[0,i].transAxes, ha='left', va='center', fontsize=12)
# Plot CDF
counts, bins, patches = axs[1,i].hist(df['Mahalanobis'], bins=100, edgecolor='black', alpha=0.5, cumulative=True, density=True)
axs[1,i].plot(bins[:-1], counts, 'maroon', lw=2)
axs[1,i].set_xlim(0, 200)
axs[1,i].set_xlabel('Mahalanobis distance')
axs[1,i].set_ylabel('Cumulative Frequency')
# Show the plot
plt.tight_layout(rect=[0, 0, 1, 0.96]) # Adjust layout to prevent overlapping
plt.show()
Use chi2 statistics to get a more presentable number: p-value based on chi2 distribution with degree of freedom = 2
# calculate p-value for each mahalanobis distance using
df_C_new_FeCrNiMoTi_Mahl['p value'] = 1 - chi2.cdf(df_C_new_FeCrNiMoTi_Mahl['Mahalanobis'], 2)
df_C_new_FeCrNiCoV_Mahl['p value'] = 1 - chi2.cdf(df_C_new_FeCrNiCoV_Mahl['Mahalanobis'], 2)
df_H_new_FeCrNiMoTi_Mahl['p value'] = 1 - chi2.cdf(df_H_new_FeCrNiMoTi_Mahl['Mahalanobis'], 2)
df_H_new_FeCrNiCoV_Mahl['p value'] = 1 - chi2.cdf(df_H_new_FeCrNiCoV_Mahl['Mahalanobis'], 2)
# re-attach the labels
df_C_new_FeCrNiMoTi_Mahl_label = pd.concat([df_C_new_FeCrNiMoTi_Mahl, df_C_new_FeCrNiMoTi_label], axis=1)
df_C_new_FeCrNiCoV_Mahl_label = pd.concat([df_C_new_FeCrNiCoV_Mahl, df_C_new_FeCrNiCoV_label], axis=1)
df_H_new_FeCrNiMoTi_Mahl_label = pd.concat([df_H_new_FeCrNiMoTi_Mahl, df_H_new_FeCrNiMoTi_label], axis=1)
df_H_new_FeCrNiCoV_Mahl_label = pd.concat([df_H_new_FeCrNiCoV_Mahl, df_H_new_FeCrNiCoV_label], axis=1)
display(df_C_new_FeCrNiMoTi_Mahl_label.iloc[[0,-1]])
# plot the histogram of p-values for all the datasets
fig, axs = plt.subplots(2, 2, figsize=(8, 8))
fig.suptitle('p-value', fontsize=20)
axs[0,0].hist(df_C_new_FeCrNiMoTi_Mahl['p value'], bins=100, edgecolor='black', alpha=0.5, density=True)
axs[0,0].set_xlabel('p-value')
axs[0,0].set_ylabel('Frequency')
axs[0,1].hist(df_C_new_FeCrNiCoV_Mahl['p value'], bins=100, edgecolor='black', alpha=0.5, density=True)
axs[0,1].set_xlabel('p-value')
axs[0,1].set_ylabel('Frequency')
axs[1,0].hist(df_H_new_FeCrNiMoTi_Mahl['p value'], bins=100, edgecolor='black', alpha=0.5, density=True)
axs[1,0].set_xlabel('p-value')
axs[1,0].set_ylabel('Frequency')
axs[1,1].hist(df_H_new_FeCrNiCoV_Mahl['p value'], bins=100, edgecolor='black', alpha=0.5, density=True)
axs[1,1].set_xlabel('p-value')
axs[1,1].set_ylabel('Frequency')
plt.tight_layout() # Adjust layout to prevent overlapping
plt.show()
| Fe | Cr | Ni | Mo | W | N | Nb | C | Si | Mn | ... | Hmix | sigma_Hmix | sigma_elec_nega | VEC | sigma_VEC | Mahalanobis | p value | dataset | UMAP Component 1 | UMAP Component 2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 69.770000 | 18.000000 | 10.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.03 | 1.0 | 1.0 | ... | -4.484751 | 4.974488 | 0.083927 | 7.719083 | 1.188494 | 4.557613 | 0.102406 | corrosion data | 2.649896 | -0.751512 |
| 780 | 63.370312 | 5.076982 | 23.439965 | 4.034033 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.0 | ... | -5.701689 | 4.064609 | 0.098628 | 8.101491 | 1.410874 | 14.750302 | 0.000627 | new NiCrMoTiFe data | 1.525560 | 16.811317 |
2 rows × 31 columns
If we believe the p value from chi2 statistics can be a measure of "novelty" (smaller ones are more likely outliers), we plot it back to PCA 2D project and also PVD representation
map the chi2 pvalues to the PCA 2D projection: it seems the variation of p value is NOT monotonic on this 2D projection
I mainly highlighted the "new" dataset (you can still see the translucent "training" data points)
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
def create_scatter(df, dataset_values, ax, cmap="RdBu", titlename = "name?"):
# Create separate dataframes for each dataset
df_literature = df[df["dataset"] == dataset_values[0]]
df_new = df[df["dataset"] == dataset_values[1]]
# Create the scatter plots
scatters = []
for df, alpha in zip([df_literature, df_new], [0.2, 0.9]):
scatter = ax.scatter(df["UMAP Component 1"], df["UMAP Component 2"], c=df["p value"], cmap=cmap, edgecolor="grey",
s=500, marker='.', alpha=alpha, vmin=0, vmax=0.1)
scatters.append(scatter)
ax.set_xlabel("UMAP Component 1", fontsize=20)
ax.set_ylabel("UMAP Component 2", fontsize=20)
ax.tick_params(axis='both', which='major', labelsize=20)
# set the subplot title to the dataset name
ax.set_title(titlename, fontsize=20, y=1.05)
# make equal aspect ratio
ax.set_aspect('equal', 'box')
return scatters
# create the subplots of 2x2
fig, ax = plt.subplots(2, 2, figsize=(18, 15), dpi=150)
datasets = [(df_C_new_FeCrNiMoTi_Mahl_label, ['corrosion data', 'new NiCrMoTiFe data'], "RdBu", "new FeCrNiMoTi to corrosion dataset"),
(df_C_new_FeCrNiCoV_Mahl_label, ['corrosion data', 'new NiCrCoVFe data'], "RdBu", "new FeCrNiCoV to corrosion dataset"),
(df_H_new_FeCrNiMoTi_Mahl_label, ['hardness data', 'new NiCrMoTiFe data'], "RdYlBu", "new FeCrNiMoTi to hardness dataset"),
(df_H_new_FeCrNiCoV_Mahl_label, ['hardness data', 'new NiCrCoVFe data'], "RdYlBu", "new FeCrNiCoV to hardness dataset")]
# Get global min and max for UMAP components
xmin, xmax = min(df['UMAP Component 1'].min() for df, _, _, _ in datasets), max(df['UMAP Component 1'].max() for df, _, _, _ in datasets)
ymin, ymax = min(df['UMAP Component 2'].min() for df, _, _, _ in datasets), max(df['UMAP Component 2'].max() for df, _, _, _ in datasets)
for i, (df, dataset_values, cmap, title) in enumerate(datasets):
axi = ax[i//2, i%2] # Get the current axis
scatters = create_scatter(df, dataset_values, axi, cmap=cmap, titlename=title)
# Set the limits
axi.set_xlim(xmin, xmax)
axi.set_ylim(ymin, ymax)
# Create a divider for the existing axes instance
divider = make_axes_locatable(axi)
# Append axes for colorbar to the right of axi, with 5% width of axi
cax = divider.append_axes("right", size="5%", pad=0.1)
cbar = fig.colorbar(scatters[-1], cax=cax)
cbar.set_label("p-value", size=20, labelpad=10)
cbar.ax.tick_params(labelsize=20)
plt.tight_layout()
plt.savefig('UMAP 2D_Mahalanobis.png')
plt.show()
Now I will plot the p value on the representation of PVD wafer
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import pandas as pd
PVD_x_y = pd.read_excel(data_path + 'PVD_x_y.xlsx')
def create_PVD_scatter(df, PVD_x_y, dataset_values, ax, cmap="RdBu", title=""):
# Create dataframe for the new dataset
df_new = df[df["dataset"] == dataset_values[1]]
# Create the scatter plot
scatter_alpha = 1
scatter_size = 2500
scatter = ax.scatter(PVD_x_y["x"], PVD_x_y["y"], c=df_new["p value"], cmap=cmap, edgecolor="grey",
s=scatter_size, marker='.', alpha=scatter_alpha, vmin=0, vmax=0.05)
for i, txt in enumerate(PVD_x_y.index+1):
ax.annotate(txt, (PVD_x_y["x"].iloc[i]-3, PVD_x_y["y"].iloc[i]-1), color="grey", alpha=1, fontsize=12)
ax.set_xlabel("X")
ax.set_ylabel("Y")
ax.set_title(title, fontsize=16, y=1.2)
ax.set_aspect('equal', 'box')
# export the df_new to excel using the name of title
df_new.to_excel(title + ".xlsx")
return scatter
fig, ax = plt.subplots(2, 2, figsize=(12, 12), dpi=150)
datasets = [(df_C_new_FeCrNiMoTi_Mahl_label, ['corrosion data', 'new NiCrMoTiFe data'], "RdBu", "new FeCrNiMoTi to corrosion dataset"),
(df_C_new_FeCrNiCoV_Mahl_label, ['corrosion data', 'new NiCrCoVFe data'], "RdBu", "new FeCrNiCoV to corrosion dataset"),
(df_H_new_FeCrNiMoTi_Mahl_label, ['hardness data', 'new NiCrMoTiFe data'], "RdYlBu", "new FeCrNiMoTi to hardness dataset"),
(df_H_new_FeCrNiCoV_Mahl_label, ['hardness data', 'new NiCrCoVFe data'], "RdYlBu", "new FeCrNiCoV to hardness dataset")]
for i, (df, dataset_values, cmap, title) in enumerate(datasets):
axi = ax[i//2, i%2]
scatter = create_PVD_scatter(df, PVD_x_y, dataset_values, axi, cmap=cmap, title=title)
divider = make_axes_locatable(axi)
cax = divider.append_axes("right", size="5%", pad=0.1)
fig.colorbar(scatter, cax=cax)
plt.tight_layout()
plt.savefig('PVD 2D_Mahalanobis.png')
plt.show()
other possibilities: I will also fix the level of color contours: because maybe the rational way here when selecting compositions is selecting a few ranges: close->intermediate->far from the center of the data group
z_C_new_FeCrNiMoTi_Mahl_label = df_C_new_FeCrNiMoTi_Mahl_label[df_C_new_FeCrNiMoTi_Mahl_label["dataset"] == 'new NiCrMoTiFe data']["p value"].to_numpy()
print(z_C_new_FeCrNiMoTi_Mahl_label)
[1.24125609e-03 1.12451677e-03 7.96168432e-04 3.74686917e-04 7.22274283e-05 2.05591068e-02 1.83056756e-02 1.56455839e-02 1.06293292e-02 3.53984425e-03 3.46507362e-04 9.12103625e-06 5.27446944e-02 7.75804829e-02 1.03717915e-01 7.21882457e-02 5.29583407e-02 1.02625674e-02 1.11932064e-03 9.76038047e-06 7.98019872e-10 5.45585649e-02 9.13466431e-02 1.39959889e-01 1.51222178e-01 9.67429326e-02 1.99134499e-02 1.70574931e-03 7.63433406e-06 9.25494126e-11 2.88585589e-02 6.44629866e-02 1.09926657e-01 1.08793784e-01 7.35906278e-02 2.11566347e-02 1.58500336e-03 9.24212276e-06 2.69801959e-11 1.80893424e-02 3.68962638e-02 5.28080222e-02 7.74750588e-02 5.93836575e-02 1.83327218e-02 1.36787910e-03 5.17356829e-06 4.71842232e-10 6.68993304e-03 1.72173533e-02 2.23123926e-02 2.88525146e-02 2.16714657e-02 9.96350297e-03 9.41950419e-04 2.70612306e-05 2.12702300e-10 6.31690422e-03 7.64492367e-03 1.14579142e-02 9.58049097e-03 5.83294296e-03 6.72030371e-04 1.21847079e-05 3.10783840e-03 3.17689738e-03 3.31989300e-03 2.27309567e-03 6.26632084e-04]
fig, (ax1, ax2) = plt.subplots(figsize = (4, 3),
ncols = 2, gridspec_kw={'width_ratios': [3, 0.1]}, dpi=150)
# plot the fixed intervals
cmap = cm.RdGy(np.linspace(0, 1, 4)) # RdBu/RdGy
c_0_01 = [i for i,v in enumerate(z_C_new_FeCrNiMoTi_Mahl_label) if v < 0.01]
c_0_05 = [i for i,v in enumerate(z_C_new_FeCrNiMoTi_Mahl_label) if v < 0.05 and v >= 0.01]
c_0_1 = [i for i,v in enumerate(z_C_new_FeCrNiMoTi_Mahl_label) if v < 0.1 and v >= 0.05]
c_abv_0_1 = [i for i,v in enumerate(z_C_new_FeCrNiMoTi_Mahl_label) if v >= 0.1]
cax1 = ax1.scatter(PVD_x_y["x"][c_0_01], PVD_x_y["y"][c_0_01], color=cmap[0], s=400, marker='.') # ,vmin=0.2, vmax=0.6
cax2 = ax1.scatter(PVD_x_y["x"][c_0_05], PVD_x_y["y"][c_0_05], color=cmap[1], s=400, marker='.') # ,vmin=0.2, vmax=0.6
cax3 = ax1.scatter(PVD_x_y["x"][c_0_1], PVD_x_y["y"][c_0_1], color=cmap[2], s=400, marker='.') # ,vmin=0.2, vmax=0.6
cax4 = ax1.scatter(PVD_x_y["x"][c_abv_0_1], PVD_x_y["y"][c_abv_0_1], color=cmap[3], s=400, marker='.') # ,vmin=0.2, vmax=0.6
ax1.set_aspect('equal', 'box')
for i, txt in enumerate(PVD_x_y.index+1):
ax1.annotate(txt, (PVD_x_y["x"][i]-2, PVD_x_y["y"][i]-1), color="grey", alpha=1, fontsize=8)
ax1.set_xlabel('X')
ax1.set_ylabel('Y')
# ax1.set_title(KW_name+'_'+PT_name)
# customised colorbar: https://matplotlib.org/stable/tutorials/colors/colorbar_only.html
cmap = mpl.cm.RdGy #RdBu/RdGy
bounds = [0, 1, 2, 3, 4]
norm = mpl.colors.BoundaryNorm(bounds, cmap.N)
cbar = fig.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=cmap),
cax=ax2, orientation='vertical',
label="p-value")
cbar.ax.set_yticklabels(['0','0.01','0.05','>0.1',''])
plt.show()